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Description 

FAST REGULAR MULTIPLIER ARCHITECTURE 

TECHNICAL FIELD 

The present invention relates to electrical 
digital circuits for performing binary multiplication by 
sum of cross products, i.e. parallel multipliers, and in 
particular relates to the architecture of such a multi- 
plication circuit's arrangement of adders for summing the 
partial products. Architectures optimized for minimum 
circuit area and/or maximum operating speed are 
especially relevant. Multipliers with balanced signal 
propagation delays for minimizing spurious transitions 
are also relevant. 

BACKGROUND ART 

A multiplication circuit or multiplier consists 
mainly of three parts: (1) a partial product generator 
made up of a matrix of AND logic gates, each operating on 
one bit of a multiplicand and one bit of a multiplier 
(here, the number, as opposed to the circuit) , (2) a 
multiplier array (also called an adder array) made up of 
columns of adders which reduce the partial products by 
summation to two words, usually called the "sum" word and 
the "carry" word, and (3) a vector merging adder for 
adding the sum and carry words to result in one output 
word, the product. When multiplying two binary numbers, 
an M-bit multiplicand and an N-bit multiplier, M x N 
partial product terms are usually generated (although 
there may be some additional terms to handle negative 
numbers) , which could alternately be thought of as N 
M-bit partial products, and the resulting product 
generally has M + N bits. In most multiplication 
circuits, both multiplicand and multiplier are of the 
same N-bit size, and the product is therefore 2N bits 
wide. 
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Multiplication circuits, when used in digital 
signal processors, are combined with an accumulator, so 
that digital filtering and other signal processing 
functions can be readily performed. The basic operation 
is ACC:= ACC + {A*B) , or ACC:= ACC - (A*B) . That is, 
typically the accumulator will add or subtract the result 
of the multiplication to the previous accumulated value. 
The accumulator is typically P bits wide, where P > 2N, 
2N bits is the width of the multiplier product, and the 
leftmost (most significant) P-2N bits, called guard bits, 
are there to prevent overflow. U.S. Pat. No. 4,575,812 
to Kloker et al. describe one such multiplier/accumulator 
circuit. A straightforward implementation of a 
multiplier/accumulator circuit has the accumulator adder 
follow the vector merging adder of the multiplier, so 
that a first addition adds the sum and carry words to 
form the multiplication product and then follows this 
with a second addition of that product with the value in 
the accumulator. Alternatively, the accumulator could be 
integrated with the multiplier by adding an extra row of 
adders to the multiplier array and providing the two word 
result to the vector merging adder. Since only one final 
adder has to be provided, this simplifies the design 
effort, and will also improve speed somewhat. 

Regardless of whether a multiplier alone or a 
combined multiplier/accumulator circuit is being 
considered, the critical path that determines operating 
speed consists of delay through the multiplier array and 
delay through the final adder (plus any delay through a 
separate accumulator adder) . The multiplier is the 
slowest part of a digital signal processor, so any 
improvement in the speed of the multiplier will improve 
the overall speed of the processor. High speed 
processing is required, for example, for implementing 
sophisticated speech and channel coding algorithms for 
digital cellular telephone communication. Another factor 
is layout area and regularity. A regular floorplan is 
easy to design and layout, whereas an irregular floorplan 
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takes considerably more time and effort to layout. The 
choice of a multiplier architecture usually involves 
tradeoffs between area and speed. Tree multiplier 
architectures have a delay proportional to O(log N) , 
5 whereas array multiplier architectures have a delay 
proportional to 0(N) (where N is the word length in 
bits) . Thus, tree architectures are faster. However, 
because tree multipliers require large shifts of data 
perpendicular to the data path, their implementation is 

10 routing intensive, requiring a larger circuit area than 
array multipliers. Tree architectures also tend to be 
very irregular in their layout. 

In U.S. Pat. Nos. 5,343,417 and 5,586,071, 
Flora describes a Wallace tree multiplier architecture in 

15 which the columns of full adders and half adders that are 
used in the multiplier to reduce the partial products by 
successive addition to sum and carry words are chosen so 
that the particular inputs to be added at each adder 
level comply with prescribed rules that enhance the 

20 multiplier's operating speed. U.S. Pat. Nos. 5,181,185 

to Han et al. and 5,504,915 to Rarick disclose other high 
speed parallel multipliers employing modified Wallace 
tree adders for summing the columns of partial products. 
All of these disclosed multiplication circuits illustrate 

25 the basic layout irregularity that is characteristic of 
tree multiplier architectures. The modified Wallace 
trees sacrifice some speed to obtain greater layout 
regularity as compared with pure Wallace tree architec- 
tures. 

30 U.S. Pat. No. 4,901,270 to Galbi et al., and an 

article by G. Goto et al. in IEEE Journal of Solid-state 
Circuits , vol. 27, no. 9, September 1992, pages 1229- 
1234, describe use of four-to-two compressor adders in 
tree multipliers for further improving their speed. In 

35 U.S. Pat. No. 5,347,482, Williams discloses that using 
nine-to-three adders in a Wallace tree simplies layout 
and signal routing because of the larger basic building 
blocks of the tree, yet operates in the same number of 
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adder delays as a three-to-two (full) adder. In U.S. 
Pat. No. 5,265,043, Naini et al. disclose a Wallace tree 
multiplier architecture that is provided with its 
carry-save adders arranged in a L-fold layout or 
5 floorplan in order to improve that architecture's layout 
regularity and reduce the required layout area. 

G. J. Hekstra et al., in "A Fast Parallel 
Multiplier Architecture", Proceedings of IEEE SvinposiuTn 
on Circuits and Systems , pages 2128-2131, 1992, describe 
LO a regular array architecture with a delay proportional to 
O(v^) . Thus, it offers to an alternative to the compact 
and regular, but slow, array multiplier architecture and 
to the fast, but irregular and large circuit area, tree 
multiplier architectures, like the Wallace tree 
L5 multiplier. The Hekstra multiplier architecture has an 
"array of arrays" -based structure consisting of a number 
of subarrays producing a series of partial sums feeding 
into a main array adding the partial sums to form the 
product. The main array stages consist of two rows of 
20 full adders in a four-to-two reductor configuration. The 
subarrays consist of rows of full adders together with 
the partial product generators. The sizes of the 
subarrays vary and have been carefully chosen to balance 
the propagation delays so that addends arrive at a main 
25 array stage simultaneously with the previous stage's 

partial sum. In Hekstra 's implementation, this occurs 
when the sizes of the subarrays, i.e. the number of full 
adder rows, increase in steps of two from one subarray to 
the next. 

30 An article by T. Sakuta et al. in IEEE 

Symposium on Low Power Electronics: Digest of Technical 
Papers, pages 3 6-37, October 1995, highlights the impor- 
tance of delay balancing in order to minimize spurious 
transitions and thereby to minimize unnecessary power 

35 dissipation. Adders start computing at the same time 
without waiting for the propagation of sum and carry 
signals from a previous stage, so that if the addends do 
not arrive simultaneously at an adder, spurious transi- 
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tions will result. These spurious transitions also 
propagate to subsequent stages, resulting in a growing 
number of transitions from one stage to the next. 
Conventional array multiplier architectures are 
5 inherently unbalanced, and thus tend to consume a lot of 
power. In contrast, Wallace-tree multipliers are 
naturally balanced due to their inherent parallel 
structure, and thus have a lower probability of 
occurrence of spurious transitions. Delay circuits could 

10 be inserted into the signal paths of any product term 

inputs that skip an adder ladder to synchronize them with 
the other inputs of corresponding adders, as taught by T. 
Sakuta et al. As for the aforementioned Hekstra 
architecture, that multiplier happens to be delay 

15 balanced only because of an appropriate selection of 
subarray sizes. 

Although the Hekstra-type multiplier architec- 
ture is very regular in comparison with the Wallace and 
other tree architectures and nearly as compact as a 

20 conventional array multiplier, and is also much faster 
than an array multiplier, it is still somewhat slower 
than the tree multiplier architectures- Because of their 
naturally balanced parallel structure, it has been 
relatively easy to incorporate four-to-two, nine-to-three 

25 and other compressor adder structures into the tree 
multipliers without destroying its balanced signal 
propagation, in order to increase its operating speed. 
Moreover, modified tree architectures and hybrid tree- 
array architectures have allowed designers to improve 

3 0 regularity and reduce circuit area to a certain extent 
without sacrificing too much speed. Accordingly, where 
space is not at a premium, tree architectures have become 
the design of choice. Where small circuit area is 
essential, circuit designers have been forced to cope 

35 with array multipliers, despite their slow speed. The 
Hekstra-type multiplier is not well known and has been 
generally ignored. Since the one-sided architecture of 
adder subarrays feeding into a single main array is not 
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inherently balanced, but rather balanced only by 
construction with a proper selection of subarray sizes, 
any modifications would require great care if balance is 
to be maintained. 

It is an object of the present invention to 
provide a modified Hekstra-type multiplier architecture 
with improved operating speed, without sacrificing 
circuit area and regularity or destroying the delay 
balance. 

DISCLOSURE OF THE INVENTION 

The object has been met with a multiplier 
architecture of the Hekstra-type, that is, one where a 
plurality of adder subarrays feed into a main adder 
array, which has been modified by replacing pairs of full 
adders in the subarrays with four-to-two compressor adder 
circuits, hereafter referred to as compressor circuits, 
in a manner that preserves the balance in the signal 
propagation delays so that partial sums arrive at each 
stage of main array simultaneously. Two types of 
compressor circuits, referred to as symmetric and 
asymmetric compressors, are used in different portions of 
the multiplier architecture. The asymmetric compressors 
are used whenever not all of its inputs are available at 
the same time. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figs. 1 and 2 are respective diagrams of 
component interconnection staructure and block layout of a 
typical prior art tree multiplier architecture. 

Figs. 3 and 4 are respective diagrams of 
component interconnection structure and block layout of a 
modified Hekstra-type multiplier architecture in accord 
with the present invention, arranged side-by-side with 
Figs. 1 and 2 for comparison. 

Fig. 5 is a detailed block schematic diagram of 
a preferred multiplier architecture of the present 
invention showing the components of the architecture's 
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multiplier array reducing partial products by summation. 
The final vector merging adder is conventional, and is 
not shown . 

Figs. 6 and 7 are standard algebraic notations 
5 illustrating multiplication by known sum-of-cross- 

products algorithms of an m-bit multiplicand and an n-bit 
multiplier to form an (m+n)-bit product for respective 
unsigned and 2's complement notations. The two's 
complement multiplication of Fig. 7 implements the Baugh- 
10 Wooley algorithm disclosed in U.S. Pat. No. 3,866,030, 
and is carried out by the preferred multiplication 
circuitry of Fig. 5. 

Figs. 8-11 are logic gate-level circuit 
diagrams of four-to-two compressor circuits for use in 
15 the multiplication circuitry of Fig. 5. 

Figs. 12 and 13 are diagrams of component 
interconnection structure for two alternate modified 
Hekstra-type multiplier architectures in accord with the 
present invention. 

20 

BEST MODE OF CARRYING OUT THE INVENTION 

With reference to Figs. 1-4, a prior art tree 
architecture is presented side-by-side with an architec- 
ture in accord with the present invention so that their 

25 respective structures, routing and propagation delays may 
be compared. In Fig. 1, it can be seen that the prior 
art structure is a full binary tree, i.e., a Wallace 
tree, in which each full adder (F) in an initial level of 
adders (level 0) operates on a set of partial products 

30 13, typically three per adder, to produce a partial sum. 
Thus, the initial level produces a set of partial sums 
equal to the number of full adders (F) in level 0 of the 
structure. The adders (F) also produce an equal number 
of carries that are transferred to level 1 of a similar 

35 tree structure responsible for summing partial products 
of the next higher significance level for the binary 
product. In Fig. 1, level 1 consists of a set of 4-to-2 
compressor circuits such as those described by Goto et 
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al., in IEEE Journal of Solid-State Circuits ^ vol. 27, 
no. 9, September 1992, pages 1229-1235, Each compressor 
circuit carries out the operations of two full adders in 
series but has a propagation delay of about 1.5 times one 
5 full adder delay. Two full adders could be used, if 

desired. Each compressor circuit (C) in level 1 takes 
four inputs from level 0, such as two partial sums output 
by two full adders (F) in level 0 in the same tree and 
two carries from equivalent level 0 full adders in the 
10 tree responsible for summing the partial products of next 
lower significance level of the binary product. Each 
level 1 compressor circuit (C) also receives another 
carry from the corresponding level 1 compressor in the 
next lower significance summing tree. The level 1 
15 compressor circuit (C) generates a carry for the 

corresponding level 1 compressor in the next higher 
significance summing tree and a second carry for a level 
2 compressor in the next higher significance summing 
tree. It also generates a partial sum for a level 2 
20 compressor in the same tree as itself. Compressors in 

levels 2 and 3 operate in a similar manner. In this way, 
each tree reduces partial products of the same signifi- 
cance level (together with carries from the next lower 
significance suitoning tree) to a final sum and a final 
25 carry. Each successive level reduces in half the number 
of partial sums, so that the number of levels required 
(and hence the propagation delay) is on the order of 
log(N) , where N is the number of partial products to be 
summed. The tree in Fig. 1 is capable of handling up to 
30 24 partial products (8 full adders times 3 partial 
products per adder) . 

One problem with such tree structures occurs 
when attempting to layout such an architecture in a 
somewhat regular manner. Because the structure is tree- 
35 like, it is difficult to get into a rectangular shape. 

In Fig. 2, the tree of Fig. 1 responsible for a single 
bitwise significance level in the final product is 
arranged in a linear fashion so that adjacent trees can 
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be arranged side-by-side to facilitate transfer of the 
carry signals from one bit-column tree to the next. Each 
block or cell in Fig. 2 represents either a full adder 
(F) or a compressor circuit (C) . As previously 
mentioned, pairs of full adders could be used instead of 
the compressor circuits. Each cell in Fig. 2 also 
indicates the level to which it belongs (LO, LI, L2, L3) • 
The transfer of partial sums to the next level is 
indicated by the arrows between cells. It can be seen 
that the tree architecture poses a serious routing 
problem. Only half of the connections between cells are 
local whereas the other half require routing through one 
or more intervening cells. With each extra level added 
to the tree hierarchy, the length of nonlocal wires 
doubles, so that whereas connection of level 0 cell and 
level 1 cells requires nonlocal wires 15 that are two 
cells long, some connections between levels 1 and 2 
require nonlocal wires 17 that are four cells long and 
certain connection between levels 2 and 3 require wiring 
19 which is eight cells long. Moreover, with each extra 
level in the hierarchy, two additional routing tracks 
through cells have to be provided. The numbers to the 
right of each cell in Fig. 2 shows the number of cell-to- 
cell wires that pass through that cell. Different cells 
have different numbers of crossing tracks for wires to 
pass through depending on their position in the line of 
cells, with the later cells tending to require more 
tracks. This situation requires extra layout effort, 
because each level in the hierarchy will require a 
different layout topology. The widths of the cells 
varies according to the number of wiring tracks they must 
accommodate. There are several blocks of cells that have 
two full adders (F) followed by one compressor circuit 
(C) . However, blocks 1, 2 and 3 are all of different 
layout type, since the different blocks require different 
numbers of routing tracks. 

Fig. 3 shows an architecture in accord with the 
present invention. This architecture has a sequence of 
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successively longer chains (CSAO, CSAl, CSA2 , CSA3 , CSA4) 
of adders producing partial sums that feed into a series 
of main adder stages (MSI, MS2, MS3, MS4). The structure 
is a connection of carry save arrays. Two such subarrays 
(CSAO and CSAl) each consist of one full adder cell for 
each column of partial products and supply partial sums 
to a first main stage adder MSI. All of the main stage 
adders are four-to-two compressor circuits. The output 
of the first main stage adder MSI and the partial sum 
provided by yet another subarray CSA2 are input into a 
second main stage adder MS2. In order to maintain the 
proper delay balance, subarray CSA2 consists of a full 
adder cell (F) and a compressor circuit (C) so that the 
partial sum generated by the subarray CSA2 arrives 
simultaneously with that of first main stage MSI at the 
second main stage adder MS2. The output of the second 
main stage adder MS2 and the partial sum output provided 
by a subarray CSA3 are input into a third main stage 
adder MS3 - Again, to maintain proper delay balance, the 
0 subarray CSA3 consists of a full adder (F) and two 

compressor circuits (C) to match the propagation delay 
through the second main stage MS2 . This sequence can 
continue to arbitrarily large structures, with each step 
in size including another main stage (e.g. MS4) and 
5 another subarray (e.g. CSA4) , where for proper balance, 
the successive carry save arrays making up the subarrays 
feeding into the main stage adders increase in size by 
one compressor circuit per subarray. Thus, subarray CSA4 
would consist of a full adder stage (F) and three 
0 compressor stages (C) . Another difference necessitated 
by the one sided nature of the "branching" in the 
structure, is that the compressor circuits (C) for the 
main stages (MSI, MS2, MS3, MS4) be symmetric circuits, 
since all inputs naturally arrive simultaneously if the 
5 subarray sizes are chosen correctly, but that at least 
some of the compressor circuits (C) in the subarrays 
(CSA2, CSA3, CSA4) be asymmetric circuits, since their 
partial product inputs would normally arrive earlier than 
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the partial sums output by the preceding stage of the 
subarrays. Additional delay circuits could be included 
like those mentioned in the article of T. Sakuta et al. 
cited previously. More detailed description of the 
symmetric and asymmetric compressors will be provided 
below with reference to Figs. 8-11, 

Turning now to Fig. 4, an advantage of this 
modified Hekstra-type structure is seen when the adder 
stages are laid out linearly in blocks. Unlike the tree 
architecture of Fig. 2, all connections are local, except 
the connections from one main stage to the next main 
stage, and from subarray CSAO to first main stage MSI. 
Thus, regardless of the total size of the architecture, 
i.e. the number of product terms to be reduced and the 
number of main stages and subarrays needed to reduce 
them, there will never be more than two signal paths 
crossing through a subarray cell and all cells can be the 
same size to accommodate those signal paths or tracks. 
The layout is very regular and only a few different types 
0 of cells are needed, repeated throughout the structure, 
thereby simplifying design. The full adders (F) in each 
subarray can be identical, the main stage compressor 
circuits (C) can be identical, and the subarray 
compressor circuits (C) can be identical regardless of 
5 whether they are in subarray CSA2 or CSA3 or stage SAl or 
SA2 , etc . 

With reference to Fig. 5, a preferred embodi- 
ment of a multiplier circuit of the present invention is 
adapted for carrying out 17-bit by 17-bit 2 's-complement 

0 binary multiplication, using the Baugh-Wooley algorithm 
of U.S- Pat. No. 3,8 66,03 0, but with the improved 
multiplier architecture of Figs. 3 and 4. In Fig. 5, the 
numbers from 0 to 3 3 on the top and bottom of the figure 
refer to the particular bit in the resulting product. 

5 The small rectangular elements with diagonal hatching 
refer to product term generators. The differently 
hatched rectangular elements immediately above subarray 
level SA3. and the solid rectangular elements above half- 
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adder cells 2Cq and 2C^ are also product terms which are 
peculiar to the Baugh-Wooley 2 ' s-complement multiplica- 
tion algorithm. All of the product teirms are detailed 
below in Fig, 7. There are three basic types of adder 
5 cells used in the circuit: half-adders (H) , full-adders 
(F) , and four-to-two compressor circuits (C) . Each of 
these adders is well known in the art* Further, the 
four-to-two compressor circuits (C) are of two types, 
asymmetric for at least the subarray stage SA3-, in Fig. 5 

10 (which, unlike Figs. 3 and 4, places the compressor 

stages SAjq, SAjq and SAj^ ahead of the full adder stages 
SA21 and SA32 of the subarray s CSAj and CSA3) , and in other 
configurations for other subarray stages as well, and 
symmetric compressor circuits for at least the main array 

15 stages MSI, MS2 and MS3. Construction of these two 

compressor types will be discussed below with reference 
to Figs. 8-11. Also, half-adders (H) could be replaced 
with full adders (F) in which one of the inputs is fixed 
at logic level zero. Likewise, a combination of a full- 

20 adder (F) followed by a half-adder (H) within a stage (or 
even two half -adders) could be replaced by a compressor 
circuit (C) in which one (or two) of the inputs is fixed 
at zero. In this way, even more regularity can be 
obtained, albeit at the expense of a slightly less 

25 optimal adder cell. 

Each cell (H, F or C) generates both a sum term 
and a carry term. Representative connections of those 
terms to inputs in the main array stages MSI, MS2 and MSB 
are shown by the arrows. Each cell of the main stages 

3 0 receives one sum term output from a previous main stage 
(or in the case of main array stage MSI, from subarray 
SAqq) , one carry term output from that same previous main 
stage (or subarray SAqq) , one sum term output from the 
subarray stage which is local to it, i.e. the block of 

3 5 adders immediately above it, and likewise a carry term 

from that same local subarray stage. The sum terms come 
from adder cells in the same bit column, while the carry 
terms come from adder cells of the next lower signifi- 
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cance (i-e., immediately to the right of the cells 
supplying the sum terms) . Thus, for example, compressor 
cell (C) in bit column 18 of main stage MS3 receives a 
sum term from the compressor C in bit column 18 of main 
stage MS2, a carry teirm from the compressor C in bit 
column 17 of main stage MS2, a sum term from the half- 
adder H in bit column 18 of subarray stage SA32, and a 
carry terin from the full-adder F in bit column 17 of 
subarray stage SAjj- In some instances, the full 
complement of two sum terms and two carry terms is not 
available (notably at the far left and far right of most 
stages) , so a compressor cell C is not needed and a 
full-adder/half-adder combination, or even a half- 
adder/half-adder combination, is all that is required. 
Thus, for example, the bit column 9 location of main 
adder stage MS2 receives a sum and carry from main stage 
MSI, but only a sum term from subarray stage SAj-, . No 
carry term from bit column 8 of stage SA21 is generated, 
so a compressor cell is not required at stage MS2 - 
column 9. As noted previously, compressors (C) could be 
used in those locations with appropriate fixed logic zero 
inputs. The connections between successive stages of the 
same subarray, namely stages SA2Q and SA2-1 of subarray CSA2 
and stages SA3Q, SAj^ and SA32 of subarray CSA3 , are purely 
local . 

With reference to Figs. 6 and 7, the partial 
products generated by the multiplier circuit depend on 
the particular binary number notation and multiplication 
algorithm to be used. The particular circuit shown in 
Fig. 5 performs the Baugh-Wooley 2's complement multipli- 
cation of Fig. 7. Fig. 6 shows the multiplication of two 
binary numbers in unsigned notation, i.e. an m-bit multi- 
plicand [a^.-,a^.2 ^2^i^ol ^ n-bit multiplier 
[b„.^ ... b2b^bQ], to form an (m+n)-bit product 

[PnHn-iPin.n.2l^in.n.3 ^z^^^o'^ * ^he algorithm used is a 

straightforward sum-of -cross-products method. The bit- 
column of the partial products (a^bj) corresponds to the 
sum of the bit significances i and j, so that, for 
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example, partial product (^^^^.2^^) has a bit-significance in 
the final product of (in-2)+l = (m-l) and appears in the 
bit-coluxnn for Pn^.^. Each column of partial products of 
the same bit-significance is added, with carries being 
5 transferred to the column of the next higher bit-signifi- 
cance. In Fig, 7, the m-bit multiplicand [aj^.-^a^.g 
a2a,ao] and n-bit multiplier [b^.^ . . . b2b^bQ] are in 
2 's-complement notation. Accordingly, [a^-iSm-a ^z^i^o^ 
represents the number { - (a^.,) 2"*"'' -f iai^_2)2^^^ + ... + (^2)2^ 

10 + (3^)2^ + (ao)2° ), and likewise [b^_, ... bgb^b^] 

represents the number { -(b^.^)2"'^ + ... + (b2)2^ + (b^)2^ + 
(bp) 2** ). Note the subtraction in the most significant 
bit position. The Baugh-Wooley algorithm generates 
cross-products in which the most significant bit (MSB) 

15 partial product of every row except the last row has one 
input from the multiplier inverted (bp, b^ , b2, h^.g) ' 

the partial products of the last row, except for the MSB 
partial product, have one input from the multiplicand 
inverted (a^, a^, a2/ ^m=2^ ' extra terms a^.,, ^n-i' 

2 0 a^^.-,, b^.-i and 1 are added at bit positions m-l, n-1, m+n-2, 

m+n-2 , and m+n-1, respectively. In practice, however, a 
"1" is not actually added to bit position m+n-2 . Instead 
the carry out of half -adder 2C^ is inverted and fed into 
half-adder H in bit position 33 of main stage MS3. The 
25 carry out of half-adder 2C, also is connected to bit 

position 34 of the sum output of main stage MS3. This 
implementation detail avoids having to provide a constant 
value in the architecture. Again, the columns of partial 
products having the same bit-significance are added, with 

3 0 carries transferred to the column of the next higher bit- 

significance. The result is a product which is also in 
2 ' s-complement notation. In Fig. 5, since m = n = 17, 
the added terms are provided to the half -adders 2Cq and 
2C^ in bit-columns 16 and 3 2 and to the half -adder (H) of 
35 main stage MS3 in bit column 33. 

Not shown in Fig. 3 is the final addition by a 
vector merging adder of the sum and carry words generated 
by the structure shown. This vector merging adder is 
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essentially identical to any of those found in the prior 
art. Several alternatives are possible: carry ripple, 
carry look-ahead, carry select, etc. Also not shown is 
any additional row of adders, either prior to or after 
the vector merging adder, for adding the accumulator bit 
values in an integrated multiplier-accumulator circuit. 
Again, this is like that found in the prior art. 
Finally, with respect to Figs. 1 - 4 it is noted that 
structure does not have to start with a row of full 
adders. Whether full adders are used depends on the size 
of the multiplier circuit at hand. For example, the 
embodiment of the present invention shown in Fig. 5 shows 
a 17 X 17 multiplier, and so requires an initial row of 
full adders as reflected in Figs. 3 and 4. 

With reference to Figs. 8-11, various possible 
four-to-two compressor circuits are shown. These replace 
pairs of successive full-adders, but have a delay of only 
about 1.5 full adders. This reduction in delays improves 
operating speed, but necessitates extreme care when 
attempting to construct a balanced multiplier structure. 
These compressor circuits are also known as five-to-three 
compressors, since there are two additional carry terms 
C,.^ and C^uf However, since these additional carry terms 
normally connect adjacent cells in the same row or stage 
and are generally not received from a previous stage or 
carried to a succeeding stage, they are not always 
counted, hence the usual designation of four-to-two 
compressor. 

The compressor circuit in Fig. 8 is that taught 
by G. Goto et al. in IEEE Journal of Solid-state 
Circuits, vol. 27, no. 9, pages 1229-1235, September 
1992. This is a symmetric compressor circuit designed 
for when all four inputs 11-14 arrive substantially at 
the same time. The logic carried out by the compressor 
is: 
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11*12+13*14 

-{ [- (11^12) +- (13^14) ]*[- (11*12) + 
-(13*14] }+{C.^*(Il^I2^I3^I4) ) 
[ (11^12) ^ (13^14) ]^C-^ 
5 

where - , + , ^ , and * represent the logical operations 
NOT, OR, XOR, and AND, respectively • In order to compare 
the different circuits, we assume unit delays, with 
delays of 1 unit for an inverting gate, 2 units for a 

10 noninverting gate and 2 units for an XOR or NXOR gate. 
The numbers in the figure represent the delays at the 
output of each gate. To generate C^^^ takes 2 unit 
delays. 0^^^^ is supplied to in an adjacent cell of 
next higher order bit-significance in the same stage. To 

15 generate both the sum term S and the carry term C takes 6 
unit delays. 

The circuits in Figs. 9-11 are completely new. 
Several rules have been followed in devising those 
circuits. The coding for the sum output S is unique. S 

2 0 will always be the parity of the five input bits 11-14 
and C^„. Specifically, if the number of I's in the five 
input bits is odd, S will be 1; S will be 0 otherwise. 
The coding for the carry outputs C^^^ and C is not unique, 
providing flexibility in design. These carry outputs 

25 represent the presence of two or more Is in the input 
pattern. If there are two or three Is at the inputs, 
there will be one and only one 1 in the carry outputs 
(either C or C^J and the other carry output will be a 
zero. Any combination that follows this rule is a valid 
30 combination that will result in correct operation of the 
compressor. Another rule, which is followed for 
optimization of the circuit, is to make C^^^^ independent 
of C.^. Therefore the bit assignment for C^^^ should be 
the same for C^^ equal to either 0 or 1. This is for 
3 5 speed reasons, to avoid rippling through the bit 

positions, because C.^ comes from the bit position of next 
lower significance and at the same level in the 




BNSDOCID: <WO 9g22292Al I > 



wo 99/22292 



-17- 



PCT/US98/22471 



hierarchy. The compressor of Fig, 8 is just one 
particular example of these rules. 

In Figs. 9 and 10, the compressor logic is: 

[ (11+12) * (13 + 14) ]+ (11*12 ) + (13*14) 
(11*12*13*14) + [C.^* (11^12^13^14) ] 
[(I1-I2)-(I3-I4) ]-C^„. 

In the Fig. 9 implementation of this logic, generating 
C^^^ takes 2 unit delays, while generating the sum and 
carry terms S and C both take 6 unit delays. There are 
equal delays from the inputs 11-14 to the primary outputs 
S and C. In other words, like the compressor of Fig. 8, 
the circuit in Fig. 9 is also symmetric. 

The compressor in Fig. 10 is an asymmetric 
version. This version has shorter delay from input II, 
and secondly from input 12, then from inputs 13 and 14, 
to generate C^^^ (and hence also C ends which depend on 
from C^jy^ of a similar adjacent circuit) . Also, the carry 
output C is slightly faster than the sum output S, by 1 
unit delay (5 versus 6 units) • This asymmetric version 
is preferred when not all inputs are available at the 
same time. Thus the slowest arriving signals can be 
provided on the shorter delay inputs II and 12, while the 
sooner arriving signals can be provided to the longer 
delay inputs 13 and 14. In Fig. 5, this asymmetric 
compressor could be used for subarray stage SAj^ in which 
the product terms are generated before the arrival of 
partial sums from stage SAjq. In the structure of Figs. 3 
and 4 in which full adder stages SAO are put first, all 
of the compressor stages SAl, SA2 and SA3 of the sub- 
arrays CSA2, CSA3, CSA4 would preferably be asymmetric. 
Other asymmetric circuits could be synthesized, depending 
on the logic cells available to the designer. 

In Fig. 11, the compressor circuit implements 
the following logic: 
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Coat = (11+12)* (13 + 14) 

C = [ (I1*I2)*-(I3^I4) ] + [-'(Il^I2)*(I3*I4) ] 

+C^„*(I1^I2^I3^I4) 
S = [ (11^12) ^ (13^14) ]^C^^. 

5 

Like the compressors in Figs, 8 and 9, it is symmetric 
with respect to the inputs 11-14. However, like Fig. 10 
it provides the carry output C slightly faster than the 
sum output S by 1 unit delay (5 versus 6 units) . 
10 The following table summarizes the advantages 

of the present invention relative to the prior art by way 
of comparison. Note that delays are expressed as Full 
Adder delays (FA) . 



15 


Architecture 


Layout 


Propaqation Paths 


Delay 
Scalina 


17x17 
Delay 




Carry Size Array 


Regular 


Unbalanced (Ripple) 


0(N) 


15 FA 


20 


Tree 


Irregular 


Inherently Balanced 


O(log N) 


6 FA 




Tree with 
Compressors 


Irregular 


Inherently Balanced 


O(log N) 


4.5 FA 


25 


Hekstra 


Regular 


Balanced by 
Construction 




7 FA 




The Invention 


Regular 


Balanced by 
Construction 


0(VR) 


5.5 FA 



30 

The invention has the advantage of being both 
regular in its layout and relatively fast in its 
operation (5.5 full adder delays), thus combining 

35 beneficial properties of both array architectures and 

tree architectures. Another advantage is that except for 
the connections between its main array stages, all 
connections are local, so that only two signal tracks 
need be provided in the layout no matter how large it is 

4 0 scaled. This is one aspect of its regularity and hence 
its small circuit area. By contrast, tree architectures 
require more and more routing tracks as they scaled to 
larger sizes. 

While the present invention, like the Hekstra 

4 5 architecture, has balanced delays in its propagation 
paths, they are not inherently balanced like tree 
architectures but only balanced by construction with a 
proper choice of subarray sizes. Accordingly, when the 



BNSDOCID: <WO 9922292A1 .1 > 



wo 99/22292 



-19- 



PCT/US98/22471 



compressor circuits of Figs. 8-11 are incorporated into 
the architecture of the present invention, special care 
has been required to ensure that balance is maintained. 
In particular, each signal path through any of the 
5 subarrays and through the main array has been constructed 
so that it presents the same number of compressor 
circuits as all other signal paths. Each successive 
subarray feeding into a successive stage of the main 
adder array has one additional compressor than the 
10 previous subarray. One full adder can (optionally) be 

present in each subarray path, as it is in Figs. 3-5. If 
the full adder heads a subarray, then any compressors in 
the remainder of that subarray should be of the 
asymmetric type. If the full adder is the last element 
15 of the subarray prior to feeding into the main array, 

then the first compressor circuit can be of the symmetric 
type. All of the main array compressors are of the 
symmetric type. With this careful construction, spurious 
transactions can be minimized. (Additional delay 
2 0 elements could be added where needed to handle residual 
imbalance, as taught by T- Sakuta et al. in the article 
referred to previously.) 

Also, the architecture of the present invention 
can be scaled by increasing the number of main array 
25 stages and corresponding subarrays. A 32x32 multiplier, 
for example, can be implemented with four main adder 
stages and no full adder stages in the subarrays (i.e. 
only compressors) . It has a propagation delay of only 
7.5 full adders. A 61x61 multiplier can be implemented 
30 with six main adder stages and a delay of only 11.5 full 
adders (still faster than a 17x17 array architecture) 
where the subarrays CSAO and CSAl consist of a full adder 
followed by a compressor, and each successive subarray 
adds one additional compressor. These constructions are 
35 illustrated in Figs. 12 and 13, respectively, in the same 
manner as Fig. 3. As a final note, it is observed that 
the structure of Fig. 13 can be easily modified to 
realize a 58x58 multiplier. This is accomplished by 
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removing the row of full adders F. The resulting 58x58 
multiplier has a delay of 10*5 full adders. 



5 
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Claims 

1. A multiplication circuit, comprising: 

means, receiving an M-bit multiplicand and an 
5 N-bit multiplier, for forming N M-bit partial products, 
where M and N are integers greater than 8, each bit of 
each partial product having a bit-significance corres- 
ponding to a specified bit of an (M+N)-bit product; and 

addition means for summing said N M-bit partial 

10 products such that bits of said partial products having 
the same bit-significance are added together, wherein 
said addition means is organized into an architecture 
that is characterized by a plurality of subarrays forming 
partial sums and a multistage main array adding said 

15 partial sums, said architecture having an asymmetric but 
delay-balanced branching architecture in which a first 
main array stage receives partial sums from two subarrays 
and each subsequent main array stage receives partial 
sums from one previous main array stage and only one 

20 corresponding subarray, the subarray for each subsequent 
main array being successively larger than subarrays for 
previous main arrays to maintain balanced propagation 
delays for partial sums provided to each main array 
stage, at least one subarray including a four-to-two 

25 compressor circuit therein, and 

a vector merging adder receiving a multibit sum 
word and a multibit carry word together representing a 
partial sum from a final main array stage of said 
addition means, said vector merging adder summing said 

30 word and carry word to produce said (M+N)-bit product. 



2. The multiplication circuit of claim 1 wherein each 
signal propagation path from a first stage of a subarray 
35 through each stage of that subarray to a stage of said 
main array and through subsequent stages of said main 
array has an identical number of compressor circuits 
compared to all other signal propagation paths. 
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3. The multiplication circuit of claim 1 wherein each 
cell of a subarray stage and each cell of a main array 
stage that receives a total of four partial product 
inputs and generates a sum term and a carry term 
5 comprises a compressor circuit. 



4. The multiplication circuit of claim 1 wherein each 
10 cell of a subarray stage and each cell of a main array 

stage that receives a total of three partial product 
inputs and generates a sum term and a carry term 
comprises a full adder and a half adder in sequence. 

15 

5. The multiplication circuit of claim 1 wherein said 
multiplicand and multiplier are in unsigned binary 
notation, said means for forming partial products 

20 generating cross-products of said M-bit multiplicand with 
said N bits of said multiplier. 



25 6. The multiplication circuit of claim 1 wherein said 

multiplicand and said multiplier are in two's- complement 
notation, said means for forming partial products 
generating cross-products in accord with Baugh-Wooley 's 
algorithm. 
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7, The multiplication circuit of claim 1 wherein said 
addition means is laid out linearly with said first main 
array stage following said two subarrays from which that 
first main array stage receives partial sums, all stages 
5 of any subarray being grouped together, and each main 
array stage subseguent to said first main array stage 
following said stages of the subarray corresponding to 
said main array stage, whereby all signal propagation 
paths are local except paths between successive main 
10 array stages, and whereby each subarray stage reguires 
tracks for only two crossing signal propagation paths. 



15 8, The multiplication circuit of claim 1, wherein at 
least one of said compressor circuits comprises: 

a first signal input, a second signal input, a 
third signal input, a fourth signal input, and a carry 
input ; 

20 a first logic gate consisting of a two-input 

NAND gate, said two inputs of said NAND gate connected to 
said first and second signal inputs; 

a second logic gate consisting of a two-input 
NAND gate, said two inputs of said NAND gate connected to 

25 said third and fourth signal inputs; 

a third logic gate consisting of a two-input OR 
gate, said two inputs of said OR gate being inverted 
inputs and connected to outputs of said first and second 
logic gates, said third logic gate providing a first 

30 carry output; 

a fourth logic gate consisting of a two-input 
OR gate feeding into one input of a two-input NAND gate, 
a second input of said NAND gate connected to said output 
of said first logic gate, said two inputs of said OR gate 

35 connected to said first and second signal inputs; 
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a fifth logic gate consisting of a two-input OR 
gate feeding into one input of a two-input NAND gate, a 
second input of said NAND gate connected to said output 
of said second logic gate, said two inputs of said OR 
5 gate connected to said third and fourth signal inputs; 

a sixth logic gate consisting of first and 
second two- input OR gates feeding into respective inputs 
of a two-input NAND gate, said two inputs of said first 
OR gate connected to said outputs of said first and 
10 second logic gates, said two inputs of said second OR 

gate connected to outputs of said fourth and fifth logic 
gates; 

a seventh logic gate consisting of a two-input 
XOR gate, said two inputs of said XOR gate connected to 
15 said outputs of said fourth and fifth logic gates; 

an eighth logic gate consisting of a two-input 
AND gate feeding into one input of a two-input OR gate, a 
second input of said OR gate connected to an output of 
said sixth logic gate, said two inputs of said NAND gate 
2 0 connected to said carry input and an output of said 

seventh logic gate, said eighth logic gate providing a 
second carry output; and 

a ninth logic gate consisting of a two-input 
XOR gate, said two inputs of said XOR gate connected to 
2 5 said carry input and said output of said seventh logic 
gate, said ninth logic gate providing a sum output. 



30 9. The multiplication circuit of claim 1, wherein at 
least one of said compressor circuits comprises: 

a first signal input, a second signal input, a 
third signal input, a fourth signal input, and a carry 
input ; 

3 5 a first logic gate consisting of a two-input 

NOR gate, said two inputs of said NOR gate connected to 
said first and second signal inputs; 
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a second logic gate consisting of a two-input 
NOR gate, said two inputs of said NOR gate connected to 
said third and fourth signal inputs; 

a third logic gate consisting of a two-input 
5 NAND gate, said two inputs of said NAND gate connected to. 
said first and second signal inputs; 

a fourth logic gate consisting of a two-input 
NAND gate, said two inputs of said NAND gate connected to 
said third and fourth signal inputs; 
10 a fifth logic gate consisting of a two-input 

NOR gate, said two inputs of said NOR gate connected to 
outputs of said first and second logic gates; 

a sixth logic gate consisting of a two-input 
NAND gate, said two inputs of said NAND gate connected to 
15 outputs of said third and fourth logic gates; 

a seventh logic gate consisting of a two-input 
NOR gate, said two inputs of said NOR gate connected to 
outputs of said fifth and sixth logic gates, said seventh 
logic gate providing a first carry output; 
20 an eighth logic gate consisting of a two-input 

NOR gate, said two inputs of said NOR gate connected to 
said outputs of said third and fourth logic gates; 

a ninth logic gate consisting of a two-input OR 
gate feeding into one input of a two-input NAND gate, a 
25 second input of said NAND gate connected to said output 

of said third logic gate, said two inputs of said OR gate 
connected to said first and second signal inputs; 

a tenth logic gate consisting of a two-input OR 
gate feeding into one input of a two-input NAND gate, a 
30 second input of said NAND gate connected to said output 
of said fourth logic gate, said two inputs of said OR 
gate connected to said third and fourth signal inputs; 

an eleventh logic gate consisting of a two- 
input XOR gate, said two inputs of said XOR gate 
3 5 connected to outputs of said ninth and tenth logic gates; 
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a twelfth logic gate consisting of a two-input 
AND gate feeding into one input of a two-input OR gate, a 
second input of said OR gate connected to an output of 
said eighth logic gate, said two inputs of said AND gate 
5 connected to said carry input and an output of said 

eleventh logic gate, said twelfth logic gate providing a 
second carry output; and 

a thirteenth logic gate consisting of a two- 
input XOR gate, said two inputs of said XOR gate 
10 connected to said carry input and said output of said 

eleventh logic gate, said thirteenth logic gate providing 
a sum output . 



15 

10. The multiplication circuit of claim 1, wherein at 
least one of said compressor circuits comprises: 

a first signal input, a second signal input, a 
third signal input, a fourth signal input, and a carry 
20 input; 

a first logic gate consisting of a three-input 
OR gate feeding into one input of a two- input NAND gate, 
a second input of said NAND gate connected to said first 
signal input, said three inputs of said OR gate connected 
2 5 to said second, third and fourth signal inputs; 

a second logic gate consisting of a two-input 
OR gate feeding into one input of a two-input NAND gate, 
a second input of said NAND gate connected to said second 
signal input, said two inputs of said OR gate connected 
30 to said third and fourth signal inputs; 

a third logic gate consisting of a two-input 
NAND gate, said two inputs of said NAND gate connected to 
said third and fourth signal inputs; 

a fourth logic gate consisting of a three-input 
35 NAND gate, said three inputs of said NAND gate connected 
to outputs of said first, second and third logic gates, 
said fourth logic gate providing a first carry output; 
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a fifth logic gate consisting of a four-input 
NAND gate, said four inputs of said NAND gate connected 
to said first, second, third and fourth signal inputs; 

a sixth logic gate consisting of a two-input 
5 XOR gate, said two inputs of said XOR gate connected to 
said first and second signal inputs; 

a seventh logic gate consisting of a two-input 
XOR gate, said two inputs of said XOR gate connected to 
said third and fourth signal inputs; 
10 an eighth logic gate consisting of a two-input 

XNOR gate, said two inputs of said XNOR gate connected to 
outputs of said sixth and seventh logic gates; 

an inverter connected to said carry input; 

a ninth logic gate consisting of a two-input OR 
15 gate feeding into one input of a two-input NAND gate, a 
second input of said NAND gate connected to an output of 
said fifth logic gate, said two inputs of said OR gate 
connected to outputs of said eighth logic gate and said 
inverter, said ninth logic gate providing a second carry 
2 0 output; and 

a tenth logic gate consisting of a two-input 
XOR gate, said two inputs of said XOR gate connected to 
said outputs of said eighth logic gate and said inverter, 
said tenth logic gate providing a sum output. 



11. The multiplication circuit of claim 1, wherein at 
least one of said compressor circuits comprises: 
3 0 a first signal input, a second signal input, a 

third signal input, a fourth signal input, and a carry 
input ; 

a first logic gate consisting of a two-input 
NOR gate, said two inputs of said NOR gate connected to 
35 said first and second signal inputs; 

a second logic gate consisting of a two-input 
NOR gate, said two inputs of said NOR gate connected to 
said third and fourth signal inputs ; 
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a third logic gate consisting of a two-input 
NOR gate, said two inputs of said NOR gate connected to 
outputs of said first and second logic gates, said third 
logic gate providing a first carry output; 
5 a fourth logic gate consisting of a two-input 

XNOR gate, said two inputs of said XNOR gate connected to 
said first and second signal inputs; 

a fifth logic gate consisting of a two-input 
XNOR gate, said two inputs of said XNOR gate connected to 
10 said third and fourth signal inputs; 

a sixth logic gate consisting of a three-input 
NAND gate, said three inputs of said NAND gate connected 
to said first and second signal inputs and an output of 
said fifth logic gate; 
^5 a seventh logic gate consisting of a three- 

input NAND gate, said three inputs of said NAND gate 
connected to said third and fourth signal inputs and an 
output of said fourth logic gate; 

an eighth logic gate consisting of a two-input 
20 XNOR gate, said two inputs of said XNOR gate connected to 
said outputs of said fourth and fifth logic gates; 

an inverter connected to said carry input; 
a ninth logic gate consisting of a two-input OR 
gate feeding into one input of a three-input NAND gate, 
25 second and third inputs of said NAND gate connected to 
outputs of said sixth and seventh logic gates, said two 
inputs of said OR gate connected to outputs of said 
eighth logic gate and said inverter, said ninth logic 
gate providing a second carry output; and 
30 a tenth logic gate consisting of a two-input 

XOR gate, said two inputs of said XOR gate connected to 
said outputs of said eighth logic gate and said inverter, 
said tenth logic gate providing a sum output. 
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12. The multiplication circuit of claim 1, wherein at 
least one of said compressor circuits comprises: 

a plurality of inputs, including a first signal 
input, a second signal input, a third signal input, a 
5 fourth signal input, and a carry input; and 

a plurality of outputs, including a first carry 
output, a second carry output, and a sum output; 

said at least one of said compressor circuits 
being characterized in that said sum output is set to 1 
10 if the number of I's in said plurality of inputs is odd, 
said sum output being set to 0 otherwise; 

said at least one of said compressor circuits 
further being characterized in that one and only one of 
said first and second carry outputs is set to 1 if the 
15 number of I's in said plurality of inputs is 2 or 3 ; 

said at least one of said compressor circuits 
further being characterized in that both of said first 
and second carry outputs are set to 1 if the number of 
I's in said plurality of inputs is 4 or 5. 

20 

13. The multiplication circuit of claim 12 wherein said 
at least one of said compressor circuits is further 
characterized in that one of said carry outputs is 

2 5 determined independently of said carry input. 



14. A multiplication circuit, comprising: 

means, receiving an M-bit multiplicand and an 

30 N-bit multiplier, for forming partial product terms 

therefrom, each partial product term corresponding to a 
specified bit of an (M+N)-bit product; and, 

for each product bit, addition means for adding 
all partial product terms that correspond to that product 

3 5 bit plus any carry terms generated by the addition means 
for the next less significant product bit, each said 
addition means generating a sum forming said product bit 
and one or more carry terms to be transferred to the 
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addition means for the next greater significant product 
bit, 

wherein each said addition means is organized 
into an architecture that is characterized by a plurality 
5 of adding stages forming partial sums, the adding stages 
being organized into a plurality of chains of successive 
subarray adders and a single chain of successive main 
array adders, a first stage in said chain of main array 
adders being an adder connected to two chains of subarray 
10 adders to receive partial sums therefrom, each stage of 
said chain of main array adders subsequent to said first 
stage being connected to a preceding stage of said main 
array adder chain and to one and only one chain of 
subarray adders, 

15 wherein each adding stage in said chain of main 

array adders being a four-to-two compression adder 
circuit, hereafter called a 'compressor', said two chains 
of subarray adders connected to said first stage of said 
main array being identical in the number of each type of 

20 adder in those chains, each chain of subarray adders 

connected to subsequent stages of said main array being 
identical to a chain of subarray adders connected to a 
preceding stage of said main array in the number of each 
type of adder in that chain except for having one more 

2 5 compressor than said preceding chain, whereby each signal 

propagation path through said chains of subarray adders 
and through said main array has a balanced delay, and 
subsequent to said addition means, a vector 
merging adder receiving a multibit sum word and a 

3 0 multibit carry word from the addition means for each 

product bit, said vector merging adder summing corres- 
ponding bits of the same bit significance of said sum 
word and said carry word to form said (M+N)-bit product. 

35 



15. The multiplication circuit of claim 13 further 
comprising a row of accumulator adders for at least each 
bit of said product. 
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16, The multiplication circuit of claim 15 wherein said 
accumulator adders are located between said addition 
means and said vector merging adder. 

5 

17. The multiplication circuit of claim 14 wherein said 
multiplicand and multiplier are in unsigned binary 
notation, said means for forming partial product terms 
generating MxN cross-products from said M bits of said 

10 multiplicand and said N bits of said multiplier. 



18. The multiplication circuit of claim 14 wherein said 
multiplicand and multiplier are in two's -complement 
15 notation, said means for forming partial product terms 
generating said terms in accord with the Baugh-Wooley 
algorithm. 



20 19. The multiplication circuit of claim 14 wherein 

compressors in stages of said chain of subarray adders 
other than a first stage are asymmetric compressors in 
which two inputs to said compressors propagate slower 
than two other inputs to sum and carry outputs of said 

2 5 compressors . 



20. The multiplication circuit of claim 14 wherein said 
compressors in said main adder array and any compressors 
30 in a first stage of any chain of subarray adders are 
symmetric compressors in which four inputs to said 
compressors propagate essentially equal in speed to sum 
and carry outputs of said compressors. 
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AMENDED CLAIMS 

[received by the International Bureau on 6 April 1999 (06.04.99); 
original claims 1, 2 and 14 amended; remaining claims unchanged (4 pages)] 



1. A multiplication circuit, comprising: 

means, receiving an M-bit multiplicand and an 
5 N-bit multiplier, for forming N M-bit partial products, 
where M and N are integers greater than 8, each bit of 
each partial product having a bit-significance corres- 
ponding to a specified bit of an (M+N)-bit product; and 

addition means for summing said N M-bit partial 

10 products such that bits of said partial products having 
the same bit-significance are added together, wherein 
said addition means is organized into an architecture 
that is characterized by a plurality of subarrays forming 
partial sums and a multistage main array adding said 

15 partial sums, said architecture having an asymmetric but 
non-inherently delay-balanced branching architecture in 
which a first main array stage receives partial sums from 
two subarrays and each subsequent main array stage 
receives partial sums from one previous main array stage 

20 and only one corresponding subarray, the subarray for 
each subsequent main array stage being successively 
larger than subarrays for previous main array stages to 
maintain balanced propagation delays for partial sums 
provided to each main array stage, at least one subarray 

25 including a four-to-two compressor circuit therein, and 

a vector merging adder receiving a multibit sum 
word and a multibit carry word together representing a 
partial sum from a final main array stage of said 
addition means, said vector merging adder summing said 

30 sum word and carry word to produce said (M+N)-bit 
product • 

2. The multiplication circuit of claim 1 wherein each 
signal propagation path from a first stage of a subarray 

35 through subsequent stages of said subarray to a stage of 
said main array and through subsequent stages of said 
main array has an identical number of compressor circuits 
compared to all other signal propagation paths. 

AMENDED SHEET (ARTICLE 19) 
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12. The multiplication circuit of claim 1, wherein at 
least one of said compressor circuits comprises: 

a plurality of inputs, including a first signal 
input, a second signal input, a third signal input, a 
fourth signal input, and a carry input; and 

a plurality of outputs, including a first carry 
output, a second carry output, and a sum output; 

said at least one of said compressor circuits 
being characterized in that said sum output is set to 1 
if the number of I's in said plurality of inputs is odd, 
said sum output being set to 0 otherwise; 

said at least one of said compressor circuits 
further being characterized in that one and only one of 
said first and second carry outputs is set to 1 if the 
number of I's in said plurality of inputs is 2 or 3 ; 

said at least one of said compressor circuits 
further being characterized in that both of said first 
and second carry outputs are set to 1 if the number of 
I's in said plurality of inputs is 4 or 5. 

13. The multiplication circuit of claim 12 wherein said 
at least one of said compressor circuits is further 
characterized in that one of said carry outputs is 
determined independently of said carry input. 



14. A multiplication circuit, comprising: 

means, receiving an M-bit multiplicand and an 
N-bit multiplier, for forming partial product terms 
therefrom, each partial product term corresponding to a 
specified bit of an (M+N)-bit product; and, 

for each product bit, addition means for adding 
all partial product terms that correspond to that product 
bit plus any carry terms generated by the addition means 
for the next less significant product bit, each said 
addition means generating a sum forming said product bit 
and one or more carry terms to be transferred to the 

AMENDED SHEET (ARTICLE 19) 
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addition means for the next greater significant product 
bit, 

wherein each said addition means is organized 
into an asymmetric, non-inherently delay balanced 
architecture that is characterized by a plurality of 
adding stages forming partial sums, the adding stages 
being organized into a plurality of chains of successive 
subarray adders and a single chain of successive main 
array adders, a first stage in said chain of main array 
adders being an adder connected to two chains of subarray 
adders to receive partial sums therefrom, each stage of 
said chain of main array adders subsequent to said first 
stage being connected to a preceding stage of said main 
array adder chain and to one and only one chain of 
subarray adders, 

wherein each adding stage in said chain of main 
array adders being a four-to-two compression adder 
circuit, hereafter called a 'compressor', each compressor 
having a delay being less than a delay associated with a 
pair of successive full adders, said two chains of 
subarray adders connected to said first stage of said 
main array being identical in the number of each type of 
adder in those chains, each chain of subarray adders 
connected to subsequent stages of said main array being 
identical to a chain of subarray adders connected to a 
preceding stage of said main array in the number of each 
type of adder in that chain except for having one more 
compressor than said preceding chain, whereby each signal 
propagation path through said chains of subarray adders 
and through said main array has a balanced delay, and 

subsequent to said addition means, a vector 
merging adder receiving a multibit sum word and a 
multibit carry word from the addition means for each 
product bit, said vector merging adder suitoning corres- 
ponding bits of the same bit significance of said sum 
word and said carry word to form said (M+N)-bit product. 
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15 • The multiplication circuit of claim 13 further 
comprising a row of accumulator adders for at least each 
bit of said product. 

16. The multiplication circuit of claim 15 wherein said 
accumulator adders are located between said addition 
means and said vector merging adder. 

17. The multiplication circuit of claim 14 wherein said 
multiplicand and multiplier are in unsigned binary 
notation, said means for forming partial product terms 
generating MxN cross-products from said M bits of said 
multiplicand and said N bits of said multiplier. 

18. The multiplication circuit of claim 14 wherein said 
multiplicand and multiplier are in two ' s-complement 
notation, said means for forming partial product terms 
generating said terms in accord with the Baugh-Wooley 
algorithm. 



25 19. The multiplication circuit of claim 14 wherein 

compressors in stages of said chain of subarray adders 
other than a first stage are asymmetric compressors in 
which two inputs to said compressors propagate slower 
than two other inputs to sum and carry outputs of said 

3 0 compressors. 

20. The multiplication circuit of claim 14 wherein said 
compressors in said main adder array and any compressors 
35 in a first stage of any chain of subarray adders are 
symmetric compressors in which four inputs to said 
compressors propagate essentially equal in speed to sum 
and carry outputs of said compressors. 

AMENDED SHEET (ARTICLE 19) 
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STATEMENT UNDER ARTICLE 19 



Applicant is amending independent claims 1 and 14 to point 
out that, the architecture of the present invention is 
"asymmetric" and "non-inherently delay balanced". Unlike the 
claims as now amended, the cited references of Goto at al. and 
Galbi et al. do not teach the need to balance the multiplier 
structure, because the tree architectures in which they are used 
are symmetric and already inherently balanced. 

The present invention uses "four-to-two compressor circuits" 
in an architecture that is otherwise similar to the architecture 
disclosed in the cited Hekstra et al . reference. The difference 
between the two architectures, as set forth in amended claims 1 
and 14, is that the Hekstra et al. reference uses two rows of 
full adders, rather than having a four-to-two compressor circuit 
in "at least one of the subarrays". Applicant asserts that a 
four-to-two compression adder is not the same as two rows of full 
adders arranged in a four-to-two reducer configuration. Due to 
the structural differences between the circuits, the time delay 
through the two rows of full adders shown in Hekstra et al . is 
longer than the time delay through the compressor architecture of 
the present invention. The cited Mou et al. reference is another 
asymmetric architecture which uses pairs of full adders in a 
reducer configuration. Again, the pair of full adders has a 
longer delay time than the four-to-two compressor claimed in the 
present invention. Applicant has amended claim 14 to further 
make this distinction, specifying that "each compressor having a 
delay being less than a delay associated with a pair of 
successive full adders," 

When the compressor circuits, such as those claimed 
specifically in claims 8-11, are incorporated into a Hekstra type 
architecture, as set forth in the amended claims 1 and 14, 
special care has been required to ensure that balance is 
maintained. Each signal path through any of the subarrays and 
through the main array has been constructed so that it presents 
the same number of compressor circuits as all other signal paths. 
Each successive subarray feeding into a successive stage of the 
main adder array has one additional compressor than the previous 
subarray. With careful construction of the compressors, spurious 
transactions and rippling effects are minimized and the 
propagation paths are balanced by construction. Additionally, 
the use of four-to-two compressors, instead of pairs of full 
adders, provides the multiplier architecture with an improved 
operating speed. 

Applicant has also made minor amendments to claims 1 and 2 
to provide proper the antecedent basis for these claims. 
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